Where do most of the players in FIFA 2018 come from? Is it South America or Europe? What is the most common age of the players listed in FIFA 2018? What is the age range of players? What is the distribution of their performance? These are the questions I would like to find an answer for through Exploratory Data Analysis. I will make use of the ggplot2 library that I learnt in the lesson coupled with plotly for interactive visualization.
The dataset features every player in Fifa 2018 with 70+ attributes. It contains personal attributes like Nationality, Photo, Club Age, Wage, Salary etc. I downloaded dataset from https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset.
Dataset is tidy except for a few columns like the Wage, Value and Preferred.Positions. I would extract the numeric values from Wage and Value columns, and pull out the most preferred position from the Preferred.Positions column with the assumption the position are in order of preference.
## Name Age
## J. Rodr<c3><ad>guez: 7 Min. :16.00
## J. Valencia : 7 1st Qu.:21.00
## J. Williams : 7 Median :25.00
## D. Gonz<c3><a1>lez : 6 Mean :25.14
## Danilo : 6 3rd Qu.:28.00
## Felipe : 6 Max. :47.00
## (Other) :17942
## Photo
## https://cdn.sofifa.org/48/18/players/197083.png: 2
## https://cdn.sofifa.org/48/18/players/198113.png: 2
## https://cdn.sofifa.org/48/18/players/198140.png: 2
## https://cdn.sofifa.org/48/18/players/198329.png: 2
## https://cdn.sofifa.org/48/18/players/198584.png: 2
## https://cdn.sofifa.org/48/18/players/198614.png: 2
## (Other) :17969
## Nationality Overall Potential
## Length:17981 Min. :46.00 Min. :46.00
## Class :character 1st Qu.:62.00 1st Qu.:67.00
## Mode :character Median :66.00 Median :71.00
## Mean :66.25 Mean :71.19
## 3rd Qu.:71.00 3rd Qu.:75.00
## Max. :94.00 Max. :94.00
##
## Club Value Wage
## : 248 Length:17981 Length:17981
## Villarreal CF : 35 Class :character Class :character
## Borussia Dortmund: 34 Mode :character Mode :character
## FC Nantes : 34
## Manchester United: 34
## OGC Nice : 34
## (Other) :17562
## Preferred.Positions Continent
## Length:17981 Length:17981
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Age ranges from 16 to 47 years with a mean of 25.14 and a median 25. I am thinking of a normal distribution of Age. I would plot a histogram in univariate plots section to see if this is the case.
Looking at the Nationality column. Top 5 countries are all from from either Europe or South America. In the univariate plot section I would perform a group by operation by Nationality and plot on a map to visualize the distribution of players by country.
The Overall and Potential columns both range from 46 to 94 with mean 66 and 71 respectively. the 5 point difference in mean makes me wonder how many players have scope of improvement. I would like to explore difference of the two columns in the plot section below.I expect these two columns to be heavily correlated.
No surprises here either most players have an Overall score of 66.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.00 14.00 21.00 31.95 36.00 565.00 12727
The wage variable has a lot of NAs. I will discard this variable from any further analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10 300 625 2252 1600 123000 1548
The value variable is very intriguing. Median value is 625K, meaning half the players are valued less than 625K and half are more than 625K. The 3rd quartile is 1.6M and the maximum value is 123M. Infact I expected such observation, because most players are not valued in the millions but I would like to explore further about the high valued plyers.
## # A tibble: 6 x 11
## Nationality mean_Overall max_Overall mean_Potential max_Potential
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United Kingdom 63.1 89.0 69.9 90.0
## 2 Germany 65.9 92.0 71.6 92.0
## 3 Spain 69.9 90.0 74.8 92.0
## 4 France 67.3 88.0 73.0 94.0
## 5 Argentina 67.8 93.0 72.5 93.0
## 6 Brazil 70.9 92.0 72.9 94.0
## # ... with 6 more variables: mean_Age <dbl>, mean_Diff <dbl>,
## # max_Diff <dbl>, mean_Value <dbl>, max_Value <dbl>, n <int>
Clearly the most redder regions are in South America and Europe. UK has the highest number of players.Most of Asia and Africa are grey in color, meaning less than 60 players are from these regions. In the middle East, there is a stark contrast between Nations, Saudi Arabia is much redder than other nations. Surprising observations are from Canada and New Zealand, both are high income countries but are grey in color, perhaps population impacts the number of players from a country.
After exploring dataset for various variables. I have following conclusions:
The dataset is tidy. Apart from a few changes like extracting numbers from a variable, I don’t need to make any more changes.
Most interesting features in the dataset are Nationality, Age, Potential, Overall, Value, and Preferred.Positions. A brief description of the features is as follows: 1. Nationality : Nationality of the player. 2. Age : Age of player 3. Potential : The potential of player. 4. Overall : The current overall standing of the player. 5. Value: What is the players value in Thousands of pounds. 6. Preferred.Positions : The preferred position of the player.
Other features like continent might be helpfu, I would explore if it is.
I created a variable PO_Diff, which accounts for difference in Potential and Overall. I also created a variable Continent.
I extracted numerical value from Wage and Value variable. Further I pulled out most preferred position from Preferred.Position variable.
From above curve I see meaningful correlation between the following:
Same story with overall, value does increase with overall score but there are some players with high overall score and less value. I wonder why are they undervalued? I would explore this further in multivariate plots.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!